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Abstract 

The problem of stochastic convex optimization with bandit feedback (in the learning community) or without 
knowledge of gradients (in the optimization community) has received much attention in recent years, in the form of 
algorithms and performance upper bounds. However, much less is known about the inherent complexity of these 
problems, and there are few lower bounds in the literature, especially for nonlinear functions. In this paper, we in- 
vestigate the attainable error/regret in the bandit and derivative-free settings, as a function of the dimension d and 
the available number of queries T. We provide a precise characterization of the attainable performance for strongly- 
convex and smooth functions, which also imply a non-trivial lower bound for more general problems. Moreover, we 
prove that in both the bandit and derivative-free setting, the required number of queries must scale at least quadrati- 
cally with the dimension. Finally, we show that on the natural class of quadratic functions, it is possible to obtain a 
"fast" 0(1/T) error rate in terms of T, under mild assumptions, even without having access to gradients. To the best 
of our knowledge, this is the first such rate in a derivative-free stochastic setting, and holds despite previous results 
which seem to imply the contrary. 

1 Introduction 

This paper considers the following fundamental question: Given an unknown convex function F, and the ability to 
query for (possibly noisy) realizations of its values at various points, how can we optimize F with as few queries as 
possible? 

This question, under different guises, has played an important role in several communities. In the optimization 
community, this is usually known as "zeroth-order" or "derivative-free" convex optimization, since we only have 
access to function values rather than gradients or higher-order information. The goal is to return a point with small 
optimization error on some convex domain, using a limited number of queries. Derivative-free methods were among 
the earliest algorithms to numerically solve unconstrained optimization problems, and have recently enjoyed increasing 
interest, being especial useful in black-box situations where gradient information is hard to compute or does not exist 
|fT8ll20l . In a stochastic framework, we can only obtain noisy realizations of the function values (for instance, due to 
running the optimization process on sampled data). We refer to this setting as derivative-free SCO (short for stochastic 
convex optimization). 

In the learning community, these kinds of problems have been closely studied in the context of multi-armed bandits 
and (more generally) bandit online optimization, which are powerful models for sequential decision making under 
uncertainty 0G]. In a stochastic framework, these settings correspond to repeatedly choosing points in some convex 
domain, obtaining noisy realizations of some underlying convex function's value. However, rather than minimizing 
optimization error, our goal is to minimize the (average) regret: roughly speaking, that the average of the function 
values we obtain is not much larger than the minimal function value. For example, the well-known multi-armed bandit 
problem corresponds to a linear function over the simplex. We refer to this setting as bandit SCO. As will be more 
explicitly discussed later on, any algorithm which attains small average regret can be converted to an algorithm with 
the same optimization error. In other words, bandit SCO is only harder than derivative-free SCO. 

When one is given gradient information, the attainable optimization error / average regret is well-known: under 
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mild conditions, it is 0(1/ vT) for convex functions and 0(1 /T) for strongly-convex functions, where T is the 
number of queries ET1 fT4l [T9l . Note that these bounds do not explicitly depend on the dimension of the domain. 

The inherent complexity of bandit/derivative -free SCO is not as well-understood. An important exception is multi- 
armed bandits, where the attainable error/regret is known to be exactly Q(yfdjT), where d is the dimension and T 
is the number of querie^] [6 5|. Linear functions over other convex domains has also been explored, with upper 
bounds on the order of O ( yfdJT) to <D{y/d 2 /T) (e.g. E0). For linear functions over general domains, information- 
theoretic fl (y/d 2 /T) lower bounds on the regret has been proven in lfTTlfT2l . However, we emphasize that these lower 
bounds are on regret, not optimization error, and were shown on non-convex domains. This falls outside the scope of 
stochastic convex optimization we consider here, where the convexity generally ensures computational tractability. 

When dealing with more general, non-linear functions, much less is known. The problem was originally considered 
over 30 years ago, in the seminal work by Yudin and Nemirovsky on the complexity of optimization 1171 . The 
authors provided some algorithms and upper bounds, but as they themselves emphasize (cf. pg. 359), the attainable 
complexity is far from clear. Quite recently, [ 15 1 provided an Q(y/d/T) lower bound for strongly-convex functions, 
which demonstrates that the "fast" (D(l/T) rate in terms of T, that one enjoys with gradient information, is not 
possible here. In contrast, the current best-known upper bounds are O ( yf d 2 /T) , O ( y d 2 /T) , O ( y/d 2 /T) for convex, 
strongly-convex, and strongly-convex-and-smooth functions respectively fPJl El; And a0(i/# 2 /T) bound for convex 
functions [3], which is better in terms of dependence on T but very bad in terms of the dimension d. 

In this paper, we investigate the complexity of bandit and derivative-free stochastic convex optimization, focusing 
on nonlinear functions, with the following contributions (see also the summary in Table[T]i: 

• We prove that for strongly-convex and smooth functions, the attainable error/regret is exactly Q(yfdPjT). This 
has three important ramifications: First of all, it settles the question of attainable performance for such functions, 
and is the first sharp characterization of complexity for a general nonlinear bandit/derivative-free class of prob- 
lems. Second, it proves that the required number of queries T in such problems must scale quadratically with the 
dimension, even in the easier derivative-free setting, and in contrast to the linear case which often allows linear 
scaling with the dimension. Third, it formally provides a Q(y/dP/T) lower bound for more general classes of 
convex problems, which is stronger than the Vt(yjd/T) lower bound known so far, e.g. through multi-armed 
bandits. 

• We analyze an important special case of strongly-convex and smooth functions, namely quadratic functions. We 
show that for such functions, one can (efficiently) attain Q(d 2 /T) optimization error, and that this rate is sharp. 
To the best of our knowledge, it is the first general class of nonlinear functions for which one can show a "fast 
rate" (in terms of T) in a derivative-free stochastic setting. In fact, this may seem to contradict the result in 
ifTBI , which shows an Q(Wd/T) lower bound on quadratic functions. However, as we explain in more detail 
later on, there is no contradiction, since the example establishing the lower bound of lfT5ll imposes an extremely 
small domain (which actually decays with T). Our 8(eP/T) result holds for a fixed domain, and under the mild 
assumption that either the minimum of the quadratic function is bounded away from the domain boundary, or 
that we can query points slightly outside the domain. Although this result is tight, we also show that under more 
restrictive assumptions on the noise process, it is sometimes possible to obtain better error bounds, as good as 
0(d/T). 

• We prove that even for quadratic functions, the attainable average regret is exactly Q(y/ ' d 2 /T), in contrast to 
the Q(d 2 /T) result for optimization error. This shows there is a real gap between what can be obtained for 
derivative -free SCO and bandit SCO, at least for such functions. Again, this stands in contrast to previously 
studied settings such as multi-armed bandits, where there is no difference in performance. 

The paper is structured as follows: In Sec. [2] we formally define the setup and introduce the notation we shall use in 
the remainder of the paper. For clarity of exposition, we begin with the case of quadratic functions in Sec. [3] providing 
algorithms, upper and lower bounds. The tools and insights we develop for the quadratic case will allow us to tackle 

1 In a stochastic setting, a more common bound in the literature is 0(dlog(T)/T), but the O-notation hides a non-trivial dependence on the 
form of the underlying linear function (in multi-armed bandits terminology, a gap between the expected rewards bounded away from 0). Such 
assumptions are not natural in a nonlinear bandits SCO setup, and without them, the regret is indeed ( %J d/T) . See for instance 1 7 Chapter 2] for 
more details. 
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Derivative-Free SCO 


Bandit SCO 


Function Type 


Upper Bound 


Lower Bound 


Upper Bound 


Lower Bound 


Quadratic 


d 2 /T 






Str. Convex and Smooth 






Str. Convex 


min | f/d 2 /T, V d3 7 T } 








min | ^/d 2 / T , V rf3 7 T } 




V d 7T 




Convex 


min [ ^/d 2 / T , a/^Vt} 








min | y^/r, \A 32 /r| 









Table 1 : A summary of the complexity results for derivative-free stochastic convex optimization (optimization error) 
and bandit stochastic convex optimization (average regret), for various function classes and in terms of the dimension 
d and the number of queries T. The boxed results are shown in this paper. The upper bounds for the convex and 
strongly convex case combine results from |fl3l |2l |3l. The table shows dependence on d, T only and ignores other 
factors and constants. 



the more general strongly-convex-and-smooth setting in Sec. [4] We end the main part of the paper with a summary 
and discussion of open problems in Sec. [5] In Appendix[A] we show how one can obtain improved performance in the 
quadratic case, if we're considering more specific noise processes. Some additional technical proofs are presented in 
Appendix [B] 

2 Preliminaries 

Let || • || denote the standard Euclidean norm. We let F(-) : W i-> K denote the convex function of interest, where 
W C M. d is a (closed) convex domain. We say that F is A-strongly convex, for A > 0, if for any w, w' 6 W and any 
subgradient g of F at w, it holds that F(w') > F(w) + (g, w' — w) + f ||w' — w|| 2 . Intuitively, this means that 
we can lower bound F everywhere by a quadratic function of fixed curvature. We say that F is /^-smooth if for any 
w,w' e W, and any subgradient g of F at w, it holds thati^(w') < F(w) + (g, w' — w) + ^|| w' — w|| 2 . Intuitively, 
this means that we can upper-bound F everywhere by a quadratic function of fixed curvature. We let w* G W denote 
a minimizer of F on w. To prevent trivialities, we consider in this paper only functions whose optimum w* is known 
beforehand to lie in some bounded domain (even if W is large or all of and the function is Lipschitz in that 
domain. 

The learning/optimization process proceeds in T rounds. Each round t, we pick and query a point w t € W, 
obtaining an independent realization of F(w) + £ w , where £ w is an unknown zero-mean random variable, sue ifjthat 
— m a x {l, ll w l| 2 }- 1° me bandit SCO setting, our goal is to minimize the expected average regret, namely 



E 



lfv( Wt )-F(w*) 
t=i 



whereas in the derivative-free SCO setting, our goal is to compute, based on wi , . . . , wt and the observed values, 
some point w € W, such that the expected optimization error 

E [F(w) — F(w*)} , 

is as small as possible. We note that given a bandit SCO algorithm with some regret bound, one can get a derivative- 
free SCO algorithm with the same optimization error bound: we simply run the stochastic bandit algorithm, getting 

2 We note that this slightly deviates from the more common assumption in the bandits/derivative-free SCO setting that E[£i] < 0(1). While 
such assumptions are equivalent for bounded W, we also wish to consider cases with unrestricted domains W = M d . In that case, assuming 
JE[5w] ^ O(l) may lead to trivialities in the derivative-free setting. For example, consider the case where F(w) = w T /lw + b T w. Then for 
any w and any £ w with uniformly bounded variance, we can get a virtually noiseless estimate of w T j4w by picking w' = cw for some large c 
and computing -j (F(w') + £ w / ). Variants of this idea will also allow virtually noiseless estimates of the linear term. 
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wi , . . . , wt, and returning 4 Ylt=i w *- By Jensen's inequality, the expected optimization error is at most the expected 
average regret with respect to Wi, . . . , Wj. Thus, bandit SCO is only harder than derivative-free SCO. 

In this paper, we provide upper and lower bounds on the attainable optimization error / average regret, as a function 
of the dimension d and the number of rounds/queries T. For simplicity, we focus here on bounds which hold in 
expectation, and an interesting point for further research is to extend these to bounds on the actual error/regret, which 
hold with high probability. 

3 Quadratic Functions 

In this section, we consider the class of quadratic functions, which have the form 

F(w) = w T 4w + b T w + c 

where A is positive-definite (with a minimal eigenvalue bounded away from 0). Moreover, to make the problem well- 
behaved, we assume that A has a spectral norm of at most 1, and that ||b|| < 1, |c| < 1. We note that if the norms 
are bounded but larger than 1, this can be easily handled by rescaling the function. It is easily seen that such functions 
are both strongly convex and smooth. Moreover, this is a natural and important class of functions, which in learning 
applications appears, for instance, in the context of least squares and ridge regression. Besides providing new insights 
for this class, we will use the techniques developed here later on, in the more general case of strongly-convex and 
smooth functions. 

3.1 Upper Bounds 

We begin by showing that for derivative-free SCO, one can obtain an optimization error bound of 0(cP /T). To the 
best of our knowledge, this is the first example of a derivative-free stochastic bound scaling as 0(1/T) for a general 
class of nonlinear functions, as opposed to 0(1/ VT). However, to achieve this result, we need to make the following 
mild assumption: 

Assumption 1. At least one of the following holds for some fixed e € (0, 1]: 

• The quadratic function attains its minimum w* in the domain W, and the Euclidean distance of vf* from the 
domain boundary is at least e. 

• We can query not just points in W, but any point whose distance from W is at most e. 

With strongly-convex functions, the most common case is that W = R d , and then both cases actually hold for any 
value of e. Even in other situations, one of these assumptions virtually always holds. Note that we crucially rely here 
on the strong-convexity assumption: with (say) linear functions, the domain must always be bounded and the optimum 
always lies at the boundary of the domain. 

With this assumption, the bound we obtain is on the order of d 2 /e 2 T. As discussed earlier, |fl5l recently proved a 
Q(wd/T) lower bound for derivative-free SCO, which actually applies to quadratic functions. This does not contradict 
our result, since in their example the diameter of W (and hence also e) decays with T. In contrast, our 0(d 2 /T) bound 
holds for fixed e, which we believe is natural in most applications. 

To obtain this behavior, we utilize a well-known 1-point gradient estimate technique, which allows us to get an 
unbiased estimate of the gradient at any point by randomly querying for a (noisy) value of the function around it (see 
ifTTl FPU ). Our key insight is that whereas for general functions one must query very close to the point of interest 
(scaling to with T), quadratic functions have additional structure which allows us to query relatively far away, 
allowing gradient estimates with much smaller variance. 

The algorithm we use is presented as Algorithm [I] and is computationally efficient. It uses a modification VV of 
the domain W, defined as follows. First, we let B denote some known upper bound on ||w* ||. If the first alternative 
of assumption [T] holds, then VV consists of all points in W n {w : ||w|| < B}, whose distance from VV's boundary 
is at least e. If the second alternative holds, then VV = W n {w : ||w|| < B}. Note that under any alternative, it 
holds that W is convex, that ||w t |j < B, that w* e VV, and that our algorithm always queries at legitimate points. In 
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Algorithm 1 Derivative-Free SCO Algorithm for Strongly-Convex Quadratic Functions 
Input: Strong convexity parameter A > 0; Distance parameter e G (0, 1] 
Initialize wi = 0. 
fort = 1,...,T ldo 

Pick r e { — 1, uniformly at random 

Query noisy function value v at point w t + 

Let g = 

Let wt+i = (w f - ig) 
end for 

Return w = w T §, J2t=T/2 w t- 



the pseudocode, we use II yy to denote projection on VV. For simplicity, we assume that T/2 is an integer and that VV 
includes the origin 0. 

The following theorem quantifies the optimization error of our algorithm. 

Theorem 1. Let F(w) = w T iw + b T w + cbe a X-strongly convex function, where ||^4||2, ||b||, |c| are all at most 1, 
and suppose the optimum w* has a norm of at most B. Then under Assumption^ the point w returned by Algorithm^ 
satisfies 

nFw-F(^]<^ 5l0gmB + 1)4d2 



Xe 2 



T 



Note that returning w as the average over the last T/2 iterates (as opposed to averaging over all iterates) is neces- 
sary to avoid log(T) factors fl9l . 

As an interesting side-note, we remark that a gradient-based approach seems essential to obtain 0(1/T) rates (in 
terms of T). For example, a different family of derivative-free methods (see for instance |[T7l l3l [T5l ) is based on a type 
of noisy binary search, where a few strategically selected points are repeatedly sampled in order to estimate which of 
them has a larger/smaller function value. This is used to shrink the feasible region where the optimum w* might lie. 
Since it is generally impossible to estimate the mean of noisy function values at a rate better than 0(1/ VT), it would 
be surprising to get an optimization rate faster than 0(1/ vT) with such methods. 

The proof of the theorem relies on the following key lemma: 



Lemma 1. For any w t , we have that 
and 



E r 



;|g|| 2 ]< 



= VF(w t ) 
4d 2 (B + l) 4 



Proof. By the way r is picked, we have that E r [rirj] = and that E r [rjrj*rfc] = for all i,j, k. Thus, letting E 
denote expectation w.r.t. r and the random function values, we have 



E[g] = 



= E 




w T Aw + b T w + c + £ Wt + -^r J r + — (r T Ar) r 



,T A uT 



-E [(2w T Ar) r + (b T r) r] 



= + 2w'A + b' + = VF(w). 
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Also, by the assumptions on A, b, c and the assumptions on the noise £ w , we have 

dv 2 „ 



E 



= E 



— 72 



= ^Et; 2 = -E 



< 



< 



< 



< 



2d 2 
2d 2 

6 2 

2d 2 
e 2 
4d 2 



sup (F(-w)) + max <^ 1, ||w t + —i=r\ 

,w:||w||<B+e I yd 



sup (w T Aw + b T w + c) + (B + 1) 

. w: llwll <B-\-e 



UB + ef+{B + e) + 1)' + (B + 1) 



-(B + l) 



as required. 



□ 



This lemma implies that Algorithm [T] essentially performs stochastic gradient descent over the strongly-convex 
function F(w), where the gradient estimates are unbiased and with bounded second moments. The returned point 
is a suffix-average of the last T/2 iterates. Using a convergence analysis for stochastic gradient descent with suffix- 
averaging [19 Theorem 5], and plugging in the bounds of Lemma [T] we get Thm. [T] 

3.2 Lower Bounds 

In this subsection, we prove that the upper bound obtained in Thm.[T]is essentially tight: namely, up to constants, the 
worst-case error rate one can obtain for derivative-free SCO of quadratic functions is order of d 2 /T. Besides showing 
that the algorithm above is essentially optimal, it implies that even for extremely nice strongly-convex functions and 
domains, the number of queries required to reach some fixed accuracy scales quadratically with the dimension d. This 
stands in contrast to the case of linear functions, where the provable query complexity often scales linearly with d. 

Theorem 2. Let the number of rounds T be fixed. Then for any (possibly randomized) querying strategy, there exists 
a quadratic function of the form F(w) = ^||w|| 2 — (e, w), which is minimized at e where ||e|| < 1, such that the 
resulting w satisfies 

( d 2 ' 

E[F(w) - F(w*)} > 0.01 min <^ 1, — 

Note that since ||e|| < 1, we know in advance that the optimum must lie in the unit Euclidean ball. Despite this, 
the lower bound holds even if we do not restrict at all the domain in which we are allowed to query - i.e., it can even 
be all of R d . 

Proof. The proof technique is inspired by a lower bound which appears in [4 1, in the different context of compressed 
sensing. 

We will assume without loss of generality that T > d 2 . Indeed, if there was a querying strategy with expected 
optimization error < 0.01 after some T' < d 2 rounds, then we could have obtained a < 0.01 optimization error for 
T = d 2 as well, by simulating the querying strategy for T' rounds and then returning the result (without using the 
surplus queries). Thus, we wish to show that 



E[F(w) -F(w*)] > 0.01 



T 



for any T > d 2 
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We will exhibit a distribution over quadratic functions F, such that in expectation over this distribution, any 
querying strategy will attain fl(d 2 /T) optimization error. This implies that for any querying strategy, there exists 
some deterministic F for which it will have this amount of error. 

The functions we shall consider are 

^e(w)-i||w|| 2 -(e,w> ! 



where e is drawn uniformly from {— /x, /i} , with /i > being a parameter to be specified later. Moreover, we will 
assume that the noise £ w is a Gaussian random variable with zero mean and standard deviation max { 1 , 1 1 w 1 1 2 } . 



By definition of 1-strong convexity, it is easy to verify that F e (w) 
optimization error (over the querying strategy) is at least 



F(e) > s||w— e|| 2 . Thus, the expected 



E[F e (w) - F e (e)] >E 



1 



> E 



1 d 

-Y 

i=i 



(Wi - CiY 



> E 



2 d 
i=l 



(1) 



We will assume that the querying strategy is deterministic: w t is a deterministic function of the previous query values 
v\, V2, ■ ■ ■ , v t -i at wi, . . . , w f _i. This assumption is without loss of generality, since any random querying strategy 
can be seen as a randomization over deterministic querying strategy. Thus, a lower bound which holds uniformly for 
any deterministic querying strategy would also hold over a randomization. 

To lower bound Eq. ([TJ, we use the following key lemma, which relates this to the question of how informative 
are the query values (as measured by Kullback-Leibler or KL divergence) for determining the sign of e's coordinates. 
Intuitively, the more similar the query values are, the smaller is the KL divergence and the harder it is to distinguish 
the true sign of each ej, leading to a larger lower bound. The proof appears in the appendix. 

Lemma 2. Let e be a random vector, none of whose coordinates is supported on 0, and let v± , V2, ■ ■ ■ , Wt be a 
sequence of query values obtained by a deterministic strategy returning a point w (so that the query location w t is a 
deterministic function ofv\,..., Vt—i, and w is a deterministic function qfvi,...,VT)- Then we have 



E 



<0 




\ 



1 \ 



i=l t=l 



where 



sup D kl (Pr (v t \ei > 0,{ e j}j^{ v l}i=l) II Pr H e * < °' i e jh^ {^}'=i)) 



and D^i represents the KL divergence between two distributions. 

Using Lemma|2] we can get a lower bound for the above, provided an upper bound on the Ut/&- To analyze this, 
consider any fixed values of {ej}j^i, and any fixed values of Ui, . . . , Since the querying strategy is assumed to 
be deterministic, it follows that w t is uniquely determined. Given this w t , the function value v t equals 



conditioned on a > 0, and 



Fe(w t ) = -IKI 



Fe(w t ) = -||W 4 | 



e i Wt j 



e J wt ^ 



+ jJLWt.i 



(2) 



(3) 



conditioned on e.i < 0. Comparing Eq. (j2j) and Eq. {3), we notice that they both represent a Gaussian distribution 
(due to the ^ Wt noise term), with standard deviation max (l, ||w t || 2 } and means seperated by 2\iwt.\- To bound the 
divergence, we use the following standard result on the KL divergence between two Gaussians fl6l : 
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Lemma 3. Let Af(p, <r 2 ) represent a Gaussian distribution variable with mean p and variance a 2 . Then 



Du (Af(pi,a 2 )\\J\f(fi2,^ 2 )) 



Using this lemma, it follows that 



Dm {P{v t \vx, . . . , t?i-i)||Q(v t j^i, . • • ,«t-i)) < 



(jH ~ M2) 2 

2<7 2 



(2/m> M ) 2 



2p 2 w 2 



2max{l,||w t || 4 } max {1, ||w t || 4 } ' 



Plugging this upper bound on the U t /& in Lemma|2] we get further lower bound on the expected optimization error 
from Eq. ([TJ by 



2p 2 




\ d ^max{l,||wJ4} 



27> 2 



(4) 



Finally, we choose \i = ^Jd/AT, and obtain a lower bound of (1 - 1 /\f2)d 2 /IQT > 0.01gP/T as required. □ 

The theorem above applies to the optimization error for derivative-free SCO. We now turn to deal with the case of 
bandit SCO and regret, showing an fl(y/d 2 /T) lower bound. Since the derivative-free SCO bound was Q(d 2 /T), the 
result implies a real gap between what can be obtained in terms of average regret, as opposed to optimization error, 
in the case of quadratic functions. This stands in contrast to previously studied settings, such as multi-armed bandits, 
where the construction implying the known £l(^d/T) lower bound (e.g. [9|) applies equally well to derivative-free 
and bandit SCO. 

Theorem 3. Let the number of rounds T be fixed. Then for any (possibly randomized) querying strategy, there exists 
a quadratic function of the form F(w) = |||w|| 2 — (e,w), which is minimized at e where |)e|| < 1/2, such that 



E 



1 T 

-J2f(w t )-F(w*) 



> 0.02 min 




Note that our lower bound holds even when the domain is unrestricted (the algorithm can pick any point in Mr). 
Moreover, the lower bound coincides (up to a constant) with the 0(^J d 2 /T) regret upper-bound shown for strongly- 
convex and smooth functions in Q. This shows that for strongly-convex and smooth functions, the minimax average 
regret is 9(-\/ 'd 2 /T). Also, the lower bound implies that one cannot hope to obtain average regret better than ^Jd 2 /T 
for more general bandit problems, such as strongly-convex or even convex problems. 

Proof. The proof relies on techniques similar to the lower bound of Thm. [2] with a key additional insight. Specifically, 
in Thm. [2] the lower bound obtained actually depends on the norm of the points w l5 . . . , (see Eq. Q), and the 
optimal w* has a very small norm. In a regret minimization setting the points wi, . . . , cannot be too far from w*, 
and thus must have a small norm as well, leading to a stronger lower bound than that of Thm. [2] 

As in the proof of Thm. [2] we may assume without loss of generality that T > d 2 , and it is enough to show that the 
expected average regret is at least 0.02 -\/g? 2 JT. This is because if there was a strategy with < 0.02 average regret after 
T < d 2 rounds, then for the case of d 2 rounds, we could just run that strategy for T rounds, compute the average w 
of all points played so far, and then repeatedly choose w in the remaining rounds. By Jensen's inequality, this would 
imply a < 0.02 average regret after d 2 rounds, in contradiction. 

Let w be an arbitrary deterministic function of wi, . . . , wj-. A proof identical to that of Thm. [2] up to Eq. (ffl, 
implies that for any /1 > 0, there exists a quadratic function of the form 



^ e = -||w|| 2 -(e,w) 
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with e € {— p, p} d , such that 



E[F e (w)-F e (w*)] >E 



dp?_ 



i 



\Tg min { M2 '^p} 



In particular, letting w = h Y^t=i w t> using Jensen's inequality, and discarding the min, we get that 

T 

T 



E 



l£F e (w t )-F e (w* 



> 



d M 2 



1 



\ t=l 



However, we also know that by strong convexity of F e , we have 

T 1 -i T 



E 



I^F e (w t )-F e (w*) 



Wf - e l 



(5) 



(6) 



t=i 



Using the fact that 



we get that 



|w 4 || 2 = ||w t - e + e|| 2 < (|K - e|| + ||e||) 2 < 2||w t - e|| 2 + 2||e|| 



|w t -e|| z > -||w t | 



1 



|w t || 2 -^ 2 



Substituting into Eq. d6]l and slightly manipulating the resulting inequality, we get 



]T||wi|| 2 < 4TE 



1 T 

-^F e (w t )-F e (w* 



For simplicity, denote the average regret term E ^ Ylt=i ^e(wt) — F e (w 
into Eq. pj, we get 



dp 2 



dp? 



R > l-\ ^(ATR + 2Tdp 2 ) >-^ 1- 



+ 2Tdp 2 . 

by R. Substituting the expression above 
Ap?TR 



Rearranging and simplifying, we get 



'dT 



dp? 



r + ^p 3 Vr. 4 



fiW2T-lj > 0. 

The equation above can be seen as a quadratic function of V R, with the roots 



1 | VdT ,, 



\ 



-//.■> 



dp 2 (l-p 2 V2T 



Now, recall that p is a free parameter that we can choose at will. If we choose it so that 1 — p 2 ^/2T > 0, then it is 
easy to show that we get two roots, one strictly positive and one strictly negative. Since we know \AR is a nonnegative 
quantity, we get that 



R > 



1 



dT 



\ 



dT 



dp? [I- p 2 V2T 
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Finally, choosing p, = T 1//4 /2 (which indeed satisfies 1 — /i 2 \/2T > 0), and simplifying, we get 

y/R > 0.17a 





Recalling that R is the expected average regret, it only remains to take the square of the two sides. We note that since 
we assume T > d 2 , then ||e|| = \fdp = J d 2 JT ~'/2 < 1/2, as specified in the theorem statement. □ 

4 Strongly Convex and Smooth Functions 

We now turn to the more general case of strongly convex and smooth functions. First, we note that in the case of 
functions which are both strongly convex and smooth, J2] Theorem 14] already provided an 0(-J 'd 2 /T) average 
regret bound (which holds even in a non-stochastic setting). The main result of this section is a matching lower 
bound, which holds even if we look at the much easier case of derivative -free SCO. This lower bound implies that 
the attainable error for strongly-convex and smooth functions is order of yf d 2 /T, and at least y 1 d 2 /T for any harder 
setting. 

Theorem 4. Let the number of rounds T be fixed. Then for any (possibly randomized) querying strategy, there exists 
a function F over M. d which is 0.5-strongly convex and 3.5-smooth; Is A-Lipschitz over the unit Euclidean ball; has a 
global minimum in the unit ball; And such that the resulting w satisfies 

E[F(w) - F(w*)} > 0.004 min 

Note that we made no attempt to optimize the constant. 

Proof. The general proof technique is rather similar to that of Thm. [2] but the construction is a bit more intricate. 
Let fj, > be a parameter to be determined later. We will look at functions of the form 

F e (w) = ||w|| 2 -^ G ; W ) (7) 
fri 1 + KM) 2 

where e is uniformly distributed on {— /i, +[J.} d . Our goal will be to prove a lower bound on the expected optimization 
error over the randomized choice of F e , with respect to deterministic querying strategies. As explained in the proof 
of Thm.[2j this would imply the existence of some fixed F e such that the expected optimization error over a (possibly 
randomized) querying strategy is the same. 

Before continuing with the proof, let us explain the intuition behind this choice of F e . For simplicity, let us 
consider the one-dimensional case (d = 1). Recall that in the quadratic setting, the function we considered (in one 
dimension) was of the form 

F e (w) = -w 2 - ew, 

where e was chosen uniformly at random from {— /i, and /i is a "small" number. Thus, the optimum is at either 
— /i or /i, and the difference \F^(w) — at these optima is order of /i 2 . However, by picking w = 6(1), 

the difference \F ll (w) — is on the order of ji - much larger than the difference close to the optimum, which 

is order of /j 2 . Therefore, by querying for w far from the optimum, and getting noisy values of F e , it is easier to 
distinguish whether we are dealing with e = +/i or e = — /i, leading to a d 2 /T optimization error bound. 
In contrast, the function we consider here (in the one-dimensional case) is of the form 

F e (w) = w 2 - GW (8) 
1 + (w/e) z 
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-0.05^ 
-0.5 



0.5 



Figure 1: The two solid blue lines represents F e (w) as in Eq. ([8JI, for e = 0.1 and e = —0.1, whereas the two 
dashed black lines represent two quadratic functions with similar minimum points. Close to the minima, F e (w) and 
the quadratic functions behave rather similarly. However, as we increase \w\, the two quadratic functions become 
rather distinguishable, whereas F e (w) become more and more indistinguishable for the two choices of e. Thus, 
distinguishing whether e = 0.1 or e = —0.1, based only on function values is of F e (w), is much harder than the 
quadratic case 



This form is carefully designed so that \F^(w) — is order of /i 2 , not just at the optima of F^ and -F-^, but for 

all w. This is because of the additional denominator, which makes the function closer and closer to w 2 the larger w is 
- see Fig. [T]for a graphical illustration. 

Formally, the properties we will need are the following: 

Lemma 4. For any \x > and any e G {— /i, the function F e in Eq. Q is: 

• 0.5-Strongly convex and 3.5-smooth 

• 2 + \/2d(i-Lipschitzfor any w such that ||w|| < 1. 

• F e is globally minimized at w* = ce, where c — 0.3489... > 1/3 



• For any e' G {— /i, which differs from e in a single coordinate, and for any w G 
| J F e (w)- J F 1 e ,(w)| <^ 2 . 

The proof is purely technical and appears in the appendix. 

We now begin to derive the lower bound. Using strong convexity and the lemma, we have 



it holds that 



E[F(w) - F(w*)} > E 

,1. 



1 



a 



L tBiei<0 



= — E 
36 



<o 



w*<0 



(9) 



We now lower bound this term using Lemma [2] To do so, we need to upper bound the KL divergence of the query 
values at round t under the two hypotheses = +/i and ~ — /i, the other coordinates being fixed. We assume each 
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noise term £ w is a standard Gaussian random variable. Thus, the query value that we see is distributed as 



^e(w t )+£v 



ejWj 



V- 

fr[ 1 + {wj/ejY 



where one of the coordinates i of e is either +n or — /i and the other coordinates are fixed. This is a Gaussian 
distribution, with mean F e (w f ) and variance 1. By Lemma [4] the difference between the two means under the two 
cases ei — ei = — n is at most /j, 2 , so by Lemma|3] the KL-divergence is at most /i 4 /2. Using Lemma|2] this 
implies that Eq. |9]l is at least 



d^_ 
72 



\ 



d 1 




Picking n = T" 1 / 4 , we get a lower bound of d/144 v / 2T > 0.004^/d 2 / r - 

Finally, note that for this choice of /i, by Lemma Q our function F e (for any realization of e) is 2 + \J 2d/VT- 

Lipschitz in the unit ball, and has a global minimum with norm at most 0.35^/ d/VT. If T > d 2 , the the Lipschitz 
parameter is at most 4 and the global minimum is inside the unit ball, satisfying the requirements in the theorem 
statement. If T < d 2 , then the bound cannot be better than what we would obtain for T — d 2 (the argument is similar 
to the one in the proof of Thm.[2]), which is 0.004. Thus, for any T, the bound is at least 



min { 0.004, 0.004 




0.004 min 




as required. 



□ 



5 Discussion 

In this paper, we considered the dual settings of bandit and derivative-free stochastic convex optimization. We provided 
a sharp characterization of the attainable performance for strongly-convex and smooth functions. The results also 
provide useful lower-bounds for more general settings. We also considered the case of quadratic functions, showing 
that a "fast" 0(1/T) rate is possible in a stochastic setting, even without knowledge of derivatives. Our results 
have several qualitative differences compared to previously known results which focus on linear functions, such as 
quadratic dependence on the dimension even for extremely "nice" functions, and a provable gap between the attainable 
performance in bandit optimization and derivative-free optimization. 

Our work leaves open several questions. For example, we have only dealt with bounds which hold in expec- 
tation, and our lower bounds focused on the dependence on d, T, where other problem parameters, such as the 
Lipschitz constant and strong convexity parameter, are fixed constants. While this follows the setting of previous 
works, it does not cover situations where these parameters scale with d. Finally, while this paper settles the case 
of strongly-convex and smooth functions, we still don't know what is the attainable performance for general convex 

functions, as well as the more specific case of strongly-convex (possibly non-smooth) functions. Our £1 (^\/d 2 /T^j 
lower bound still holds, but the existing upper bounds are much larger: min | %J d 2 /T, d 32 /T j for convex func- 
tions, and min | y / d 2 /T 1 ^d 32 /T^ for strongly-convex functions (see table 1 1. We don't know if the lower bound or 

the existing upper bounds are tight. However, it is the current upper bounds wmch seem less "natural", and we suspect 
that they are the ones that can be considerably improved, using new algorithms which remain undiscovered. 
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A Improved Results for Quadratic Functions 

In Sec. [3] we showed a tight Q(d 2 /T) bound on the achievable error for quadratic functions, in the derivative-free 
SCO setting. This was shown under the assumption that the noise £ w is zero-mean and has a second moment bounded 
by max{l, ||w|| 2 }. In this appendix, we show how under additional assumptions on the noise, one can improve on 
this result with an efficient algorithm. 

To give a concrete example, consider the classic setting of ridge regression, where we have labeled training exam- 
ples (x, y) sampled i.i.d. from some distribution over M. d x R, and our goal is to find some w E R d minimizing 



i f, (w) = ^||w|| 2 +E (x , 3/) [(w T x-y) 



In a bandit / derivative-free SCO setting, we can think of each query as giving as the value of 

^(w) = ^||w|| 2 + (w T x- 2/ ) 2 . (10) 

for some specific example (x, y), and note that its expected value (over the random draw of (x, y)) equals F(w). 
Thus, it falls within the setting considered in this paper. However, the noise process is not generic, but has a particular 
structure. We will show here that one can actually attain an error rate as good as 0(d/T) for this problem. 

To formally present our result, it would be useful to consider a more general setting, the ridge regression setting 
above being a special case. Suppose we can write F(w) as E[F(w)], where F(w) decomposes into a deterministic 
term i?(w) and a stochastic quadratic term G(w): 

.T a„. , C.T. 



F(w) = R(w) + G(w) = R(w) i^w'Aw + b'w 

where A, b, c are random variables. We assume that whenever we query a point w, we get i^(w) for some random 
realization of A, b, c. In general, i?(w) can be a strongly-convex regularization term, such as f ||w|| 2 in Eq. |Io| ). 

The algorithm we consider, Algorithm [2] is a slight variant of Algorithm [T] which takes this decomposition of 
F(w) into account when constructing its unbiased gradient estimate. Compared to Algorithm [I] this algorithm also 
queries at random points further away from w t , up to a distance of \fd. We will assume here that we can always query 
at such points^ We also let VV — W n {w : ||w|| < B} in the algorithm, where we recall that B is some known upper 
bound on ||w* 

Algorithm 2 Derivative-Free SCO Algorithm for Decomposable-Quadratic Functions 
Input: Deterministic term i?(-); Strong convexity parameter A > 
Initialize wi = 0. 
fort = 1,...,T ldo 

Pick r e {— 1, uniformly at random 

Query noisy function value v at point w t + r 

Let g = (v — R (w t + r)) r + gj?(w ( ), where g^(w) is a subgradient of R(-) at w 
Let Wt+i = n w (w t - ^g) 
end for 

Return w = w-rj z2t=T/2 w *- 

We now show that with this algorithm, one can improve on our 0(d 2 /T) error upper bound from (Thm.[T]i. 

Theorem 5. In the setting described above, suppose || A\\2, ||b|| , |c| are all at most 1 with probability 1, the optimum 
w* has a norm of at most B, and ||gfl(w) || < N for any w 6 W. Then under Assumption^ the point w returned by 
Algorithm^satisfies 

N 2 + 3d ((B + l) 4 + E [|| \ 
E[F(w) - F(w*)} < 4(4 + 5 log(2)) ^ — != ^, 

3 Similar to Algorithm^ if one can only query at some distance e\fd, where e £ (0, 1], then one can modify the algorithm to handle such cases, 
with the resulting error bound depending on e. 
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where \\ ■ \\p is the Frobenius norm. 

Note that if we only assume ||A||2 < 1, then \\A\\ 2 F can be as high as d, which leads to an 0(d 2 /T) bound, same 
as in Thm. [T] However, it may be much smaller than that. In particular, for the ridge regression case we considered 
earlier, A corresponds to xx T where x is a randomly drawn instance. Under the common assumption that ||x|| < 0(1) 
(independent of the dimension), it follows that ||xx T |||, = ||x|| 4 = 0(1). Therefore, is independent of the 

dimension, leading to an 0(d/T) error upper bound in terms of d, T. 

We remark that even in this specific setting, the 0(d/T) bound does not carry over to the bandit SCO setting (i.e. 
in terms of regret), since the algorithm requires us to query far away from w t . Also, we again emphasize that this 
result does not contradict our lower bound in the quadratic case (Thm. [2), since the setting there included a generic 
noise term, while here the stochastic "noise" has a very specific structure. 

As to the proof of Thm. [5] it is very similar to that of Thm. [T] the key difference being a better moment upper bound 
on the gradient estimate g 2 , as formalized in the following lemma. Plugging this improved bound into the calculations 
results in the theorem. 

Lemma 5. For any w t , we have that E r u [g] is a subgradient of F(~w t ), and 



E 



E r ^[||g|| 2 ] < 4 (a^ 2 + 3d ((B + 1) 
Proof. By definition of F(w t ), we note that 

(w f + r) T A (w t + r) + b T (w t + r) + c ) r + g R {w t ) 



Using a similar calculation to the one in the proof of Lemma[T] we have that the expected value of this expression over 
r and A, b, c is 

2w7E[i]+E[b T ]+ gi? (w f ), 
which is a subgradient of F(w t ). As to the moment bound, we have 



E[||g|| 2 ] < E 



(< 



4 (w. 



r) T A (w t + r 



4 b'(w t +r) || r 



< 4dE 
= 4dE 
12d ( B 4 



(\\A\\ 



2 ||w t || 2 + 2wJAr + r T Ar 



B 



2wjAr + r T Ar 



4E 



(wJArX 



E 



-2 {B 2 
(r T ArY 



2^(b T Wi 

b T r >2 



4c 2 ||r|| 2 +4||g^(w t )||< 

2 s 



b T r 



AN 2 



1 



B +E 



b T r 



+ 4d + AN 2 . 



(11) 



Letting Sjj denote entry (i, j) in A, and recalling that by definition of r, E[r,rj] = li=,-, we have that 















E 


"(r T Ar) 2 " 


= E 




= E 


r i r j r i' r j'Q , iJ® , i' ,j 















E 



E 



2 2-2 
r i r j a i,j 



= E 









E 



Also, using the fact that E[rr T ] is the identity matrix, we have 

2" 



E 



[wjAv 



E 



w t T ^rr T 4 T w t 



E 



w f T ii T w ( 



< E 



|w t f P|| 



< B 2 . 
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Finally, we have 



E 



E 



b T rr T b 



E 



< 1. 



Plugging these inequalities back into Eq. ( 1 1 1, we get that 



E[||gf] < l2d[B 4 +AB 2 +E[\\A\\ 2 F \ 
= 4d( 3B 4 + UB 2 + 3 + 3 E 



: (B 2 + 1) + 4rf + AN 2 



\\A\ 



AN 2 



< 12d((B + 1) 4 +E ||i| 



AN 



from which the lemma follows. 

B Technical Proofs 
B.l Proof of Lemma H 

We have the following: 



E 1 ^ 6 ^ = y^Pr {wjej < 0) 

.1=1 J i=l 

1 d 

= g E ( Pr (^ < °i e * > °) + Pr (^ > °l e * < °)) 

= \ [d - ( Pr (^ > 0|e, > 0) - Pv{wi > 0|e, < 0))^ 



> 

~ 2 



f 1 - ^ E (Pr( ^ > 0|e * > 0) " Pr ^ > 0|e * < 0))2 j 



□ 



(12) 



where the last inequality is by the fact that for any values a±, . . . , aa, it holds that |ai|+. . .+|a<j| < Vdy/ a\ + ■ ■ ■ + a%- 
Consider (without loss of generality) the term corresponding to the first coordinate, namely 

(Pr(wi > 0|ei > 0) - Pr(u)i > 0|ei < 0)) 2 . 

This term equals 

( E Pr ({ e i}^=2) ( Pr («i > 0|ei > 0, {e^ =2 ) - Pr (w! > 0|ei < 0, {e,}^))) 

< E Pr ({ e i>i=2) (Pr (™l > 0|ei > 0, {e^ =2 ) - Pr (tSi > 0|ei < 0, {e 3 } d ]=2 )f 

e 2 ,....e d 

< sup (Pr(wi>0|ei>0 ) {e i }^ =2 )-Pr(ii} 1 >0|ei<0 ) {e i }^ =2 )) 2 

e 2 ,...,e d 

By Pinsker's inequality and the assumption that w is a deterministic function ofv%,..., Vt, this expression is at most 

~D kl (Pr (vi, . . .,v T \e x > 0,{ ej }<f =2 ) || Pr (v u . . .,v T \ ei < 0, {e,-}^)) , 
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where Dki{P\\Q) is the Kullback-Leibler divergence between the two distributions. By the chain rule (see e.g. [ 10|), 
we can upper bound the above by 



1 T 

-Y,D kl (Pr(v t \ ei >0,{e J }UdviY l Zl) II Pr («t|ei < 0, {e^ =2 , Mfci)) 



t=i 



Plugging these bounds back into Eq. ( 12 1, the result follows. 

B.2 Proof of Lemma |4] 

Note that we can write the function F e (w) as Yli=i 9ei ( w i)> where 



g a (x) = x 



2 



ax 



l + (x/a) 2 ' 

It is not hard to realize that to prove the lemma, it is enough to prove that: 

1. g a (x) is 0.5-strongly convex and 3.5-smooth; 

2. |^(z)|isatmosf]2M + |a|; 

3. For all /x, \g^(x) - g-^(x)\ < [i 2 \ 

4. g a {x) is minimized at ca where c = 0.3489.... 

To show item[T] we calculate the second derivative of g a {x), which is 

2 ( 1 , a 3 x(3a 2 



(a 2 + x 2 ) 3 

By definition of strong convexity and smoothness, it is enough to show that this term is always at least 0.5 and at most 
3.5. Substituting x = ay and simplifying, we get 

y(3-y 2 )' 

z I i i- 

It is a straightforward exercise to verify that 

always in [0.5, 3.5] as required. 
As to item [2] we note that 



2 1 1 ' (i + y 2 ) 3 



V(3-V 2 ) 



(i+y 2 ) 3 



is at most 3/4 for all y G R, hence the expression above is 



g a {x) = 2x~ ( a2 + x2)2 = 2x - a {1 + {x/a)2) 2- 

For any value of x/a, the value of the fraction above is easily verified to be at most 1, hence we can upper bound 
|<?„(a;)| by 2\x\ + \a\ as required. 
As to item [3] we have 

I / \ r \\ 2 2 IH <, 2 

\g,(x) - g^(x)\ = - l - ( ^ )2 = „ < „ , 

where the last step uses /i 2 + x 2 > 2\^x\, which follows from the identity (fi + \x\) 2 > 0. 
Finally, as to item|4] we note that this function can be equivalently written as 

g a (x)=a 2 ( {x /a) 2 --^- 

Substituting x = ay, we get a 2 (y 2 — y / (1 + y 2 )) . A numerical calculation reveals that the minimizing value of y is 
0.3489..., hence the minimizing value of x is 0.3489... * a as required. 



4 Since this would imply that || VF e (w)|| is at most ^ti ( 2 K I + A*) 2 < \/T,Li ( 4 ™ 2 + V) < ^Ef=i(4w|) + ^Ei=i( 2 M 2 ) 
2||w|| + \/2dfi, which is at most 2 + \/2dfj, for any w in the unit ball. 
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